1 Project Description

1.1 General Overview

  • CAMBRIDGE GARMENTS Industry (CGI) is a leading fast fashion retailer in Pakistan, that deals in traditional and western clothing styles.

  • The company has a presence in every major city of Pakistan, operating most of the stores by itself, which goes by the name The Cambridge Shop (TCS).

  • CGI has captured the pulse of white-collar Pakistani men, who generally like to wear quality cotton shirts, with its craftsmanship and uncompromising quality.

1.2 R Markdown


2 Data Overview

  • We got raw sales data for CGI Men’s shirt, which is typically used in the Supply Chain department for sales tracking and stock movement analysis.

  • The data has multiple numerical and many categorical variables.

  • The data is invoice influenced, where each record represents a portion of the transaction or a value of a particular shirt brand sold.

  • To make use of this data for any Exploratory Data Analysis (EDA) and Machine Learning (ML), it is necessary to clean the data and adapt it according to our requirements.

  • The initial dataset contains 46 columns and 670,000 records, where each record is identified by a “BillNo”.

  • Each record contains information about the sold out shirt from TCS related to a particular “cobrand”, which results in multiple record of similar “BillNo” but of different “cobrands”.

If you think our project is boring look at this:

Alt Text

  • With clean data, we compare the location of TCS and prepare the location data of 68 positions with the help of Google Maps. This data is further used to present locations on maps.

3 Flow chart of the project

Steps followed in the project


4 Exploratory Data Analysis (EDA)

  • It is important to name variables according to the data they hold and eliminates columns with similar data. This is perhaps the most important thing when one wrangles the actual data.

  • EDA includes the renaming of variables, eliminating variables, and comparing them to check whether they represent the data accordingly, through various graphs and charts.

  • Lastly, it is important to check the summary statistics of the data.

4.1 Step1 - Import modules


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.impute import SimpleImputer

4.2 Step2: Import data

df1=pd.read_excel('/Users/snawaz/Documents/pychilla2/teamproject_sep3/Deep_note_linked/Sales.xlsx')

4.3 Step3: Check shape of Data

df=df1.copy()
rows,cols=df.shape
print("Number of rows in dataset are",rows)
## Number of rows in dataset are 670082
print("Number of columns in the dataset are",cols)
## Number of columns in the dataset are 45

4.4 Step 4: Check the data types of the columns

df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 670082 entries, 0 to 670081
## Data columns (total 45 columns):
##  #   Column             Non-Null Count   Dtype         
## ---  ------             --------------   -----         
##  0   BillNo             670082 non-null  object        
##  1   BillDate           670082 non-null  datetime64[ns]
##  2   LoyaltyCard        33720 non-null   object        
##  3   Customer           661558 non-null  object        
##  4   Description        66014 non-null   object        
##  5   BillMonth          670082 non-null  object        
##  6   Warehouse          670082 non-null  object        
##  7   RegionName         670082 non-null  object        
##  8   Location           670082 non-null  object        
##  9   Category           670082 non-null  object        
##  10  DepartmentName     670082 non-null  object        
##  11  BrandName          668530 non-null  object        
##  12  CoBrand            670082 non-null  object        
##  13  Barcode            670082 non-null  int64         
##  14  DesignNo           670082 non-null  object        
##  15  Rejection          670082 non-null  object        
##  16  SeasonName         670082 non-null  object        
##  17  Attribute1         670057 non-null  object        
##  18  Attribute2         143125 non-null  object        
##  19  Attribute3         309262 non-null  object        
##  20  Attribute4         670082 non-null  object        
##  21  Attribute5         337322 non-null  object        
##  22  Attribute6         0 non-null       float64       
##  23  Attribute7         0 non-null       float64       
##  24  Attribute8         490234 non-null  object        
##  25  LocalImport        670082 non-null  object        
##  26  Color              670082 non-null  object        
##  27  Sizes              670082 non-null  object        
##  28  DiscountType       502886 non-null  object        
##  29  SalesmanName       670082 non-null  object        
##  30  Qty                670082 non-null  int64         
##  31  SalesReturnReason  14786 non-null   object        
##  32  Price              670082 non-null  int64         
##  33  Amount             670082 non-null  int64         
##  34  SaleExclGST        670082 non-null  float64       
##  35  GSTP               670082 non-null  int64         
##  36  GST                670082 non-null  int64         
##  37  DiscPer            670082 non-null  float64       
##  38  DiscAmount         670082 non-null  float64       
##  39  BarcodeDiscPer     670082 non-null  int64         
##  40  BarcodeDiscount    670082 non-null  int64         
##  41  NetAmount          670082 non-null  float64       
##  42  PointsEarned       670082 non-null  int64         
##  43  TaxPer             670082 non-null  int64         
##  44  Cobrand Acc        670082 non-null  object        
## dtypes: datetime64[ns](1), float64(6), int64(10), object(28)
## memory usage: 230.1+ MB

4.5 Step 5: Check the first 5 rows of the data

DT::datatable(head(py$df,20), options = list(pageLength = 5,scrollX=T))

4.6 Step 6: Check the null values of the data

Since there are either repetitive values or nans so we are dropping Attribute 4 column

##                    percent_missing
## BillNo                    0.000000
## BillDate                  0.000000
## LoyaltyCard              94.967780
## Customer                  1.272083
## Description              90.148370
## BillMonth                 0.000000
## Warehouse                 0.000000
## RegionName                0.000000
## Location                  0.000000
## Category                  0.000000
## DepartmentName            0.000000
## BrandName                 0.231613
## CoBrand                   0.000000
## Barcode                   0.000000
## DesignNo                  0.000000
## Rejection                 0.000000
## SeasonName                0.000000
## Attribute1                0.003731
## Attribute2               78.640674
## Attribute3               53.847141
## Attribute4                0.000000
## Attribute5               49.659594
## Attribute6              100.000000
## Attribute7              100.000000
## Attribute8               26.839700
## LocalImport               0.000000
## Color                     0.000000
## Sizes                     0.000000
## DiscountType             24.951573
## SalesmanName              0.000000
## Qty                       0.000000
## SalesReturnReason        97.793404
## Price                     0.000000
## Amount                    0.000000
## SaleExclGST               0.000000
## GSTP                      0.000000
## GST                       0.000000
## DiscPer                   0.000000
## DiscAmount                0.000000
## BarcodeDiscPer            0.000000
## BarcodeDiscount           0.000000
## NetAmount                 0.000000
## PointsEarned              0.000000
## TaxPer                    0.000000
## Cobrand Acc               0.000000
msno.matrix(df)

Columns Attribute 6 and 7 have 100% missing values. There is no description available for Attribute5 column so all of these columns are dropped.

df.drop(['Attribute6','Attribute7'],axis=1,inplace=True)

4.7 Step 9: Dropping useless columns

df.Attribute4.unique()
## array(['CAMBRIDGE'], dtype=object)

Column Salemanname is a useless feature since we are not interested to identify the sales by each person to avoid bias in decision making.


df.drop('Attribute4',axis=1,inplace=True)
df.drop('SalesmanName',axis=1,inplace=True)
df.drop('Category',axis=1,inplace=True)

The data between columns CoBrand and CoBrand Acc is 99% similar so we are keeping only column out of it.

from fuzzywuzzy import fuzz

fuzz.token_sort_ratio(df['CoBrand'], df['Cobrand Acc'])
df.drop('CoBrand',axis=1,inplace=True)
df.drop('Barcode',axis=1,inplace=True)

4.8 Data Preprocessing

4.8.1 Renaming some columns

Based on the data description, we are renaming some columns to make it more readable

  • Attribute1 is the Inventory_status which shows the cloth which come back to ware house and go back for sales
  • Attribute2 is the offer ctypes at the shops.
  • Attribute8 is the type of the cloth.
  • Attribute3 is the import from which cloth is imported.
df = df.rename(columns={'Attribute1':'Inventory_status',
'Attribute2':'Offers','Attribute8':'Class_of_cloth',
'Attribute3':'Import_type'})

4.8.2 For customer column

  • Using returning and non-returning customers.

  • We identify the customers by names and assume that customers with identical names is actually a returning customer.

  • We will look in graphs if the returning customers are more profitable than the non-returning customers.

df['Customer']=df.Customer.duplicated()
df['Customer'].replace(True,'Returning_Costumer',inplace=True) #<<
df['Customer'].replace(False,'NoN_Returning_Costumer',inplace=True) #<<
df.Customer.value_counts()
## Returning_Costumer        596405
## NoN_Returning_Costumer     73677
## Name: Customer, dtype: int64
  • For loyalty card column. We replaced NaNs with No and Yes for the rest of the values.
  • LoyaltyCard is the actually card offered to customers to get discounts and offers.
df['LoyaltyCard']=df.LoyaltyCard.duplicated()
df['LoyaltyCard'].replace(True,'Yes',inplace=True)
df['LoyaltyCard'].replace(False,'No',inplace=True)

4.8.3 Column Imputing by sklearn Imputing

We used the sklearn imputer to impute the missing values in the data. We used the most frequent value to impute the missing values in the data.

imp = SimpleImputer(strategy="most_frequent")
df['Inventory_status']=imp.fit_transform(df[['Inventory_status']])

imp = SimpleImputer(strategy="most_frequent") #<<
df['Import_type']=imp.fit_transform(df[['Import_type']])

df['Offers']=imp.fit_transform(df[['Offers']])
df['Attribute5']=imp.fit_transform(df[['Attribute5']])
df['BrandName']=imp.fit_transform(df[['BrandName']])
df['Description']=imp.fit_transform(df[['Description']])
df['Inventory_status']=imp.fit_transform(df[['Inventory_status']])

df.SalesReturnReason = df.SalesReturnReason.replace(np.nan,'No information available')
df.Class_of_cloth = df.Class_of_cloth.replace(np.nan, "No information") # replacing Nan values with 0 value

df.DiscountType = df.DiscountType.replace(np.nan, 'No Discount') # replacing Nan values with Most  frequent value

4.9 Summary of the data

df.describe().T
##                     count         mean          std  ...     50%     75%       max
## Qty              670082.0     0.899518     0.611418  ...     1.0     1.0      45.0
## Price            670082.0  2437.725566   714.776793  ...  2259.0  2376.0    6127.0
## Amount           670082.0  2173.785338  1621.785795  ...  2259.0  2376.0  106920.0
## SaleExclGST      670082.0  1740.625980  1489.339984  ...  1663.0  2186.0   77455.0
## GSTP             670082.0     6.092880     2.285465  ...     6.0     6.0      17.0
## GST              670082.0   109.105263   120.435080  ...    95.0   125.0    4647.0
## DiscPer          670082.0     0.382138     3.200854  ...     0.0     0.0     100.0
## DiscAmount       670082.0     8.756731    95.321598  ...     0.0     0.0   14830.0
## BarcodeDiscPer   670082.0    17.706990    20.796915  ...    20.0    30.0      60.0
## BarcodeDiscount  670082.0   423.838609   536.809144  ...   484.0   713.0   32085.0
## NetAmount        670082.0  1849.731244  1581.756942  ...  1747.0  2295.0   82102.0
## PointsEarned     670082.0     1.078650     6.030229  ...     0.0     0.0     821.0
## TaxPer           670082.0     6.359038     0.973759  ...     6.0     6.0       9.0
## 
## [13 rows x 8 columns]

5 Data Transformation

Next, we checked the Pearson correlation and visualized it through a heat map. This shows which variables are highly correlated and in this case, “Qty” is highly correlated with “NetAmount”, “SalesExclGST” and “GST”.

Now we will treat the data with outliers and apply some statistical tests

There are many outliers in numerical columns which can be shown by boxplot. We will use the IQR method to remove the outliers.


6 Data Transformation

6.0.1 Checking skewness of the data

# Library for skewness
from scipy.stats import skew
print('Skew of Price',skew(df.Price))
## Skew of Price 1.7332960811793
print('Skew of Amount',skew(df.Amount))
## Skew of Amount 0.5264229479147174
print('Skew of SaleExclGST',skew(df.SaleExclGST))
## Skew of SaleExclGST 0.053907402377267245
print('Skew of GSTP',skew(df.GSTP))
## Skew of GSTP 3.422663381156023
print('Skew of GST',skew(df.GST))
## Skew of GST 1.8886742450354679
print('Skew of BarcodeDiscPer',skew(df.BarcodeDiscPer))
## Skew of BarcodeDiscPer -0.11344278021015818
print('Skew of BarcodeDiscount',skew(df.BarcodeDiscount))
## Skew of BarcodeDiscount 1.639017468026673
print('Skew of NetAmount',skew(df.NetAmount))
## Skew of NetAmount 0.11210500286062883
print('Skew of PointsEarned',skew(df.PointsEarned))
## Skew of PointsEarned 21.43903163597488


Now by imputing right skewed by median and left skewed ‘Price’ by mean we will remove the outliers.

df['Qty'] = impute_outliers_IQR_with_median(df['Qty'])
df['Price'] = impute_outliers_IQR_with_mean(df['Price'])
df['Amount'] = impute_outliers_IQR_with_median(df['Amount']) #<<
df['SaleExclGST'] = impute_outliers_IQR_with_median(df['SaleExclGST'])
df['GSTP'] = impute_outliers_IQR_with_median(df['GSTP'])
df['GST'] = impute_outliers_IQR_with_median(df['GST'])
df['BarcodeDiscPer'] = impute_outliers_IQR_with_mean(df['BarcodeDiscPer'])
df['BarcodeDiscount'] = impute_outliers_IQR_with_median(df['BarcodeDiscount'])
df['NetAmount'] = impute_outliers_IQR_with_mean(df['NetAmount'])
df['PointsEarned'] = impute_outliers_IQR_with_median(df['PointsEarned'])
df['DiscPer'] = impute_outliers_IQR_with_median(df['DiscPer'])
df['DiscAmount'] = impute_outliers_IQR_with_median(df['DiscAmount'])

After outliers removal in numerical variables


We can apply the shapiro wilk test to check if the data is normal.

# Shapiro wilk test
from scipy.stats import shapiro #<<
print('ShapiroTest of Price',shapiro(df.Price))
## ShapiroTest of Price ShapiroResult(statistic=0.8969478011131287, pvalue=0.0)
## 
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1760: UserWarning: p-value may not be accurate for N > 5000.
##   warnings.warn("p-value may not be accurate for N > 5000.")
print('ShapiroTest of Amount',shapiro(df.Amount))
## ShapiroTest of Amount ShapiroResult(statistic=0.908415675163269, pvalue=0.0)
print('ShapiroTest of SaleExclGST',shapiro(df.SaleExclGST))
## ShapiroTest of SaleExclGST ShapiroResult(statistic=0.8960244059562683, pvalue=0.0)
print('ShapiroTest of GSTP',shapiro(df.GSTP))
## ShapiroTest of GSTP ShapiroResult(statistic=0.6320035457611084, pvalue=0.0)
print('ShapiroTest of GST',shapiro(df.GST))
## ShapiroTest of GST ShapiroResult(statistic=0.8931977152824402, pvalue=0.0)
print('ShapiroTest of BarcodeDiscPer',shapiro(df.BarcodeDiscPer))
## ShapiroTest of BarcodeDiscPer ShapiroResult(statistic=0.8688030242919922, pvalue=0.0)
print('ShapiroTest of BarcodeDiscount',shapiro(df.BarcodeDiscount))
## ShapiroTest of BarcodeDiscount ShapiroResult(statistic=0.8954393863677979, pvalue=0.0)
print('ShapiroTest of NetAmount',shapiro(df.NetAmount))
## ShapiroTest of NetAmount ShapiroResult(statistic=0.9133129715919495, pvalue=0.0)
print('ShapiroTest of PointsEarned',shapiro(df.PointsEarned))
## ShapiroTest of PointsEarned ShapiroResult(statistic=1.0, pvalue=1.0)
## 
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1757: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
##   warnings.warn("Input data for shapiro has range zero. The results "

6.1 Normalization of the numeric columns.

6.1.1 Methods of Normalization

Four common normalization techniques may be useful:

  • scaling to a range(min-max)
  • clipping
  • log scaling
  • z-score

We use the standard scaler to normalize the data.


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #<<
df[['Qty','Price','Amount','SaleExclGST','GSTP','GST','BarcodeDiscPer','BarcodeDiscount','NetAmount','PointsEarned','DiscPer','DiscAmount']] = scaler.fit_transform(df[['Qty','Price','Amount','SaleExclGST','GSTP','GST','BarcodeDiscPer','BarcodeDiscount','NetAmount','PointsEarned','DiscPer','DiscAmount']])

Checking the normality after normalization

from scipy.stats import shapiro
print('ShapiroTest of Price',shapiro(df.Price))
## ShapiroTest of Price ShapiroResult(statistic=0.8945704102516174, pvalue=0.0)
## 
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1760: UserWarning: p-value may not be accurate for N > 5000.
##   warnings.warn("p-value may not be accurate for N > 5000.")
print('ShapiroTest of Amount',shapiro(df.Amount))
## ShapiroTest of Amount ShapiroResult(statistic=0.9033786058425903, pvalue=0.0)
print('ShapiroTest of SaleExclGST',shapiro(df.SaleExclGST))
## ShapiroTest of SaleExclGST ShapiroResult(statistic=0.8960385918617249, pvalue=0.0)
print('ShapiroTest of GSTP',shapiro(df.GSTP))
## ShapiroTest of GSTP ShapiroResult(statistic=0.6283426880836487, pvalue=0.0)
print('ShapiroTest of GST',shapiro(df.GST))
## ShapiroTest of GST ShapiroResult(statistic=0.8964444398880005, pvalue=0.0)
print('ShapiroTest of BarcodeDiscPer',shapiro(df.BarcodeDiscPer))
## ShapiroTest of BarcodeDiscPer ShapiroResult(statistic=0.8605928421020508, pvalue=0.0)
print('ShapiroTest of BarcodeDiscount',shapiro(df.BarcodeDiscount))
## ShapiroTest of BarcodeDiscount ShapiroResult(statistic=0.8955459594726562, pvalue=0.0)
print('ShapiroTest of NetAmount',shapiro(df.NetAmount))
## ShapiroTest of NetAmount ShapiroResult(statistic=0.9149541854858398, pvalue=0.0)
print('ShapiroTest of PointsEarned',shapiro(df.PointsEarned))
## ShapiroTest of PointsEarned ShapiroResult(statistic=1.0, pvalue=1.0)
## 
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1757: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
##   warnings.warn("Input data for shapiro has range zero. The results "
print('ShapiroTest of DiscPer',shapiro(df.DiscPer))
## ShapiroTest of DiscPer ShapiroResult(statistic=1.0, pvalue=1.0)
print('ShapiroTest of DiscAmount',shapiro(df.DiscAmount))
## ShapiroTest of DiscAmount ShapiroResult(statistic=1.0, pvalue=1.0)

7 Applying Statistical test

7.1 1-way ANOVA Test on SaleExclGST for 3 regions

##                 Df    Sum Sq  Mean Sq F value   Pr(>F)    
## RegionName       2 9.830e+07 49152403   25.44 8.95e-12 ***
## Residuals   607200 1.173e+12  1932049                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We apply the TukeyHSD test to check for which region company have significant difference from other regions.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = SaleExclGST ~ RegionName, data = Regions)
## 
## $RegionName
##                         diff       lwr      upr     p adj
## 2-LAHORE-1-KARACHI 20.071480  9.550150 30.59281 0.0000232
## 3-NORTH -1-KARACHI 30.050739 19.299537 40.80194 0.0000000
## 3-NORTH -2-LAHORE   9.979259 -2.897429 22.85595 0.1641908

Results show that * Sales in Lahore is significantly different than sales in Karachi. * Sales in North is significantly different than sales in Lahore. * Sales in North and Karachi are not statistically significant from each other at 95% confidence interval.


8 Data Visualizations

image

image

8.1 Monthly revenue for the company

8.2 Buyers by Region

8.3 Co-brands in demand

image


9 Machine learning Model Building

ML has been applying in the capacity of Regression, Classification, Clustering and Deep Learning models. “Price” has been a dependent variable in every model except Classification. “Season” is used as a output variable in Classification. The concept of taking price, is to determine the customer behaviour with regards to price.

Dependent variables in various ML algorithms will be * Price for linear regression where other select features will be independent variable. * Price for multilenar regression where all other variables will be independent. * Inventory_status and LocalImport where all other are independent variables in classification ML model. * Price in Deep learning model * 2 Clustering Models for SeasonName.

9.1 Step_1: Encoding the categorical columns

  • We will use One_Hot encoding for 2 variables Attribute8 which is actually the class of the sales and Local Import
One_hot_encoded_data = df_Ml[['Offers']]

enc = OneHotEncoder()
enc_results = enc.fit_transform(One_hot_encoded_data)


enc=pd.DataFrame(enc_results.toarray(), columns=enc.categories_)

10 Classification Models

  • Target feature: Inventory_status & LocalImport
  • Independent features: All other features obrained after CV and feature selection

10.1 Local Import as target feature

10.1.1 Step 1: Subset of large dataset

  • We selected a sample of 10000 rows for comparison of different classification model.
df_M = df_Ml.sample(10000).reset_index(drop=True)

10.1.2 Step2 Test train split the data

  • Data is splitted before hand with 30% test data and 70% train data.
X=df_M.drop('LocalImport',axis=1)
y=df_M['LocalImport']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=44, shuffle =True)

10.1.3 Step 3: Make a list of classification models we want to apply

final_clf = None
clf_names = ["Logistic Regression", "KNN(3)", "XGBoost Classifier", "Random forest classifier", "Decision Tree Classifier",
            "Gradient Boosting Classifier", "Support Vector Machine"]

10.1.4 Step4: Apply each model on 10000 rows dataset and get the scores

classifiers = []
scores = []
for i in range(10):
    
    tempscores = []
    
    # logistic Regression
    lr_clf = LogisticRegression(n_jobs=-1)
    lr_clf.fit(X_train, y_train)
    tempscores.append((lr_clf.score(X_test, y_test))*100)
    
    # KNN n_neighbors = 3
    knn3_clf = KNeighborsClassifier(n_jobs=-1)
    knn3_clf.fit(X_train, y_train)
    tempscores.append((knn3_clf.score(X_test, y_test))*100)

    # XGBoost
    xgbc = XGBClassifier(n_jobs=-1,seed=41)
    xgbc.fit(X_train, y_train)
    tempscores.append((xgbc.score(X_test, y_test))*100)

    # Random Forest
    rf_clf = RandomForestClassifier(n_jobs=-1)
    rf_clf.fit(X_train, y_train)
    tempscores.append((rf_clf.score(X_test, y_test))*100)

    # Decision Tree
    dt_clf = DecisionTreeClassifier()
    dt_clf.fit(X_train, y_train)
    tempscores.append((dt_clf.score(X_test, y_test))*100)

    # Gradient Boosting 
    gb_clf = GradientBoostingClassifier()
    gb_clf.fit(X_train, y_train)
    tempscores.append((gb_clf.score(X_test, y_test))*100)
    
    #SVM
    svm_clf = SVC(gamma = "scale")
    svm_clf.fit(X_train, y_train)
    tempscores.append((svm_clf.score(X_test, y_test))*100)
    
    scores.append(tempscores)

scores = np.array(scores)
clfs = pd.DataFrame({"Classifier":clf_names})
for i in range(len(scores)):
    clfs['iteration' + str(i)] = scores[i].T

means = clfs.mean(axis = 1)
means = means.values.tolist()

clfs["Average"] = means


clfs.set_index("Classifier", inplace = True)
print("Accuracies : ")
clfs["Average"].head(10)

image

  • At this stage we have got the best classification with highest accuracy for getting prediction of LocalImport variable.

10.1.5 Step 5: Selecting full data for feature selection

X=df_M.drop('LocalImport',axis=1)
y=df_M['LocalImport']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=44, shuffle =True)

print('X_train shape is ' , X_train.shape)
print('X_test shape is ' , X_test.shape)
print('y_train shape is ' , y_train.shape)
print('y_test shape is ' , y_test.shape)
  • Now comes the best feature selection part. We will use RFE to select the best features for each model.

10.1.6 Step 6: Select and plot best features byfeatures_importance

model = GradientBoostingClassifier()
model.fit(X,y)
plt.style.use('ggplot')
plt.figure(figsize=(6,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(50).plot(kind='barh')
plt.savefig("extra_tree.png",dpi=200)
plt.show()

image

Only 7 features are good enough out of 38 to train our final model. so we drop rest of the features below.

df_M = df_Ml[['BrandName','Attribute5','GSTP','SeasonName','BillMonth','Price','DesignNo','Import_type','LocalImport']]

We could also select Pearson’s corelation coefficient values

10.1.7 Step 7: Hyperparameters selection

Making a dictionary of the hyperparamters for Gradient boost model. We will apply grid cross validation approach to select best hyperparamters for our selected model.

#Creating Parameters
params = {
    'learning_rate':[0.1,1],
    'n_estimators':[5,9, 11,12,13,15,20,25,26,29,31,20,50,75,100],
    'max_features':['auto','sqrt','log2'],
    'criterion':['friedman_mse', 'squared_error', 'mse'],
    'loss':['log_loss', 'deviance', 'exponential']
}


#Fitting the model

from sklearn.model_selection import GridSearchCV

rf = GradientBoostingClassifier()
grid = GridSearchCV(rf, params, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_params_)
print("Accuracy:"+ str(grid.best_score_))

{‘criterion’: ‘friedman_mse’, ‘learning_rate’: 1, ‘loss’: ‘exponential’, ‘max_features’: ‘auto’, ‘n_estimators’: 75} Accuracy:1.0

  • It gives us the best hyperpatamters at scoring paramter chosen as accuracy.

  • Remember model will remain the same with similar X_train, y_train data.

10.1.8 Step 8: Final application of model

# applying model with best hyperparameters

rf = GradientBoostingClassifier(criterion='friedman_mse', learning_rate=1, loss='exponential', max_features='auto', n_estimators=75)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy Score: ", accuracy_score(y_test, y_pred))

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred)) 

Accuracy Score: 1.0 Confusion Matrix: [[ 9895 0] [ 0 191130]]

  • We have got 100% accuracy for our final model.

10.1.9 Step9: Model scores and evaluation

from sklearn.naive_bayes import GaussianNB
from yellowbrick.classifier import ClassificationReport

# Instantiate the classification model and visualizer

visualizer = ClassificationReport(rf)

visualizer.fit(X_train, y_train)  # Fit the visualizer and the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show() 

image

10.1.10 Step 10: Saving the model to get prediction later

import pickle
pkl_filename = "localImport_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(grid, file)

# Load from file
with open(pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

# Calculate the accuracy score and predict target values
score = pickle_model.score(X_test, y_test)


y_predict = pickle_model.predict(X_test)

11 Classification Models

11.1 Inventory_status as target feature.

Best Params: {‘criterion’: ‘entropy’, ‘max_features’: None, ‘n_estimators’: 75, ‘random_state’: 1} Train MSE: 0.0 Test MSE: 0.146

modelkn = RandomForestClassifier(random_state=1,criterion= 'entropy' ,n_jobs=-1, n_estimators=75, max_features=None) 

modelkn= modelkn.fit(X_train,y_train)

y_11 = modelkn.predict(X_test)

Model scores for inventory_status

image

Confusion matrix for inventory_status

image


12 Regression Models

12.1 Price and SaleExclGST as target features.

  • Target features: Price of article. & SaleExclGST gained by company +NetAmount paid by customer.

  • In one regression model there is only 1 target feature while in the other there are two target features.

  • Only difference from the previous approach is the use of RepeatedKFold cross validation method to get the best model.

  • Model evaluation is done by R2 score and negative RMSE score.

  • Again we use sample of 10000 rows to compare models followed by full sample to train the final model.

models = {}
models['lr']= LinearRegression()
models['dr'] = DecisionTreeRegressor()
models['rf'] = RandomForestRegressor()
models['kn']=  KNeighborsRegressor()
models['ad'] = AdaBoostRegressor()
models['ex'] = ExtraTreesRegressor()
models['sv'] = SVR()


from sklearn import model_selection

for model in models:
  cv = sklearn.model_selection.RepeatedKFold(n_splits=100,n_repeats=1,random_state=1)
  n_scores = model_selection.cross_val_score(models[model],X,y,scoring='neg_root_mean_squared_error',cv=cv,n_jobs=-1)
  print(model, np.mean(n_scores),np.std(n_scores))

lr -0.015952386710040633 0.023420996302700597 dr -0.0684330302042271 0.13595572185672142 rf -0.05608366484049382 0.13108548370117068 kn -0.953965008983964 0.20208368080610473 ex -0.056439850981039486 0.19471193286786767

We picked Random Forest regressor for further analysis.

12.1.1 Features selection by two methods: RFECV method

image

image

Best Hyperparameters for Price prediction are: {‘criterion’: ‘absolute_error’, ‘max_features’: ‘sqrt’, ‘n_estimators’: 11, ‘random_state’: 33} Train MSE: 0.01264030494634164 Test MSE: 0.003932264888039404

After putting 5 features and above hyperparameters we get model scores as Test MSE: 0.003932264888039404 Test RMSE: 0.001966132444019702

R-squared score (training): 0.986 R-squared score (test): 0.995

Price prediction

Revenue prediction


13 Regression Models

13.1 Price as target feature.

Regression Models comparison for prediciting price. Neg. RMSE score is used as evaluation metric.

image

RandomForestRegressor had the highest RMSE so we choose this model for further analysis.

optimum features selection for price prediction by running

image

image

Best hyperparameters for price prediction by GridSearchCV method

Best Params: {‘criterion’: ‘squared_error’, ‘max_features’: ‘sqrt’, ‘random_state’: 42, ‘splitter’: ‘best’} Train MSE: 1.331754172731566e-06 Test MSE: 0.002528635026436538

modelkn = DecisionTreeRegressor(criterion= 'squared_error', max_features= 'sqrt', random_state= 42, splitter= 'best') 

modelkn= modelkn.fit(X_train,y_train)

y_11 = modelkn.predict(X_test)

In the we get R2 score of 1

image

and other score are given below

R-squared score (training): 1.000 R-squared score (test): 0.997

Test MSE: 0.002528635026436538 Test RMSE: 0.001264317513218269


14 Clustering Models

We will select the variable seasonName. Finding the best number of clusters using elbow method and silhouette score.

14.1 Finding best number of clusters

14.1.1 Silhoutte method

X = df_M.drop(["SeasonName"], axis = 1)
y = df_M[["SeasonName"]]

fig, ax = plt.subplots(3, 2, figsize=(15,8))
for i in [2, 3, 4, 5, 6, 7]:
    '''
    Create KMeans instance for different number of clusters
    '''
    km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
    q, mod = divmod(i, 2)
    '''
    Create SilhouetteVisualizer instance with KMeans instance
    Fit the visualizer
    '''
    visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(X)

image

14.1.2 Elbow method

inertias = []
for n_clusters in range(2, 15):
 km = KMeans(n_clusters=n_clusters).fit(df_Ml)
 inertias.append(km.inertia_)
 
plt.plot(range(2, 15), inertias, ‘k’)
plt.title(“Inertia vs Number of Clusters”)
plt.xlabel(“Number of clusters”)
plt.ylabel(“Inertia”)

14.2 image

15 Clustering Models

15.1 Prinicpal Component Analysis

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
X = df_M.drop(["SeasonName"], axis = 1)
y = df_M[["SeasonName"]]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

pca = PCA(n_components=2)

projected=pca.fit_transform(X_train)
pca.transform(X_test)
## array([[ 8.90100050e+04, -9.42476131e+01],
##        [ 1.79653719e+05, -9.48771973e+02],
##        [-5.53152733e+04, -1.07599693e+03],
##        ...,
##        [ 1.03830699e+05, -9.38314582e+02],
##        [ 1.95697285e+05, -4.29913757e+02],
##        [ 1.49871216e+05,  9.19070541e+02]])
pca.explained_variance_ratio_
## array([9.99660603e-01, 1.68253481e-04])
plt.scatter(projected[:, 0], projected[:, 1],
            c=projected[:,0], edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('CMRmap', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x7fbfaece4a00>

15.2 DBSCAN Clustering

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.4, min_samples=20)
db.fit(X)
## DBSCAN(eps=0.4, min_samples=20)
y_pred = db.fit_predict(X)
plt.figure(figsize=(10,6))
plt.scatter(X['SaleExclGST'], y['SeasonName'],c=y_pred, cmap='Paired')
plt.title("Clusters determined by DBSCAN")


16 Conclusion

  • Cambridge Industries makes a database of their sales.

  • All the information related to customer choices such as loyalty card, product exchanges and type of product bought is available is the database.

  • We have performed EDA analysis followed by data cleaning and feature engineering.

  • Statistical aanalysis show that there is significant difference between sales offline and online sales offered by the company.

  • We have taken subset of data based on different products and locations. A map was shown where the products are bought and delivered.

  • We have used different machine learning models to predict the price of the product such regression, classification, clustering and deep neural network with output features Sales (revenue of the company), inventory_status, SeasonName and Price respectively.

  • In the end we conclude that Company should focus on the products which are in high demand and are in stock.